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The use of virtual data for enhancing the collaboration between large groups of scientists is explored in several 
ways: 

• by defining "virtual" parameter spaces which can be searched and shared in an organized way by a collaboration 
of scientists in the course of their analysis 

• by providing a mechanism to log the provenance of results and the ability to trace them back to the various 
stages in the analysis of real or simulated data 

• by creating "check points" in the course of an analysis to permit collaborators to explore their own analysis 
branches by refining selections, improving the signal to background ratio, varying the estimation of parameters, 
etc. 

• by facilitating the audit of an analysis and the reproduction of its results by a different group, or in a peer 
review context. 

We describe a prototype for the analysis of data from the CMS experiment based on the virtual data system 
Chimera and the object-oriented data analysis framework ROOT. The Chimera system is used to chain together 
several steps in the analysis process including the Monte Carlo generation of data, the simulation of detector 
response, the reconstruction of physics objects and their subsequent analysis, histogramming and visualization 
using the ROOT framework. 



1. INTRODUCTION 

A look-up in the Webster dictionary gives: 
virtual 
Function: adjective 

Etymology: Middle English, possessed of certain 
physical virtues, from Medieval Latin virtualis, from 
Latin virtus strength, virtue. 

In this contribution we explore the virtue of vir- 
tual data in the scientific analysis process, taking 
as an example the coming generation of high energy 
physics (HEP) experiments at the Large Hadron Col- 
lider (LHC), under construction at CERN close to 
Geneva. 

Most data in contemporary science are the product 
of increasingly complex computations and procedures 
applied on the raw information coming from detectors 
(the "measurements") or from numeric simulations - 
e.g. reconstruction, calibration, selection, noise re- 
duction, filtering , estimation of parameters etc. High 
energy physics and many other sciences are increas- 
ingly CPU and data intensive. In fact, many new 
problems can only be addressed at the high data vol- 
ume frontier. In this context, not only data analy- 
sis transformations, but also the detailed log of how 



those transformations were applied, become a vital 
intellectual resource of the scientific community. The 
collaborative processes of these ever-larger groups re- 
quire new approaches and tools enabling the efficient 
sharing of knowledge and data across a geographically 
distributed and diverse environment. 

The scientific analysis process demands the precise 
tracking of how data products are to be derived, in 
order to be able to create and/or recreate them on de- 
mand. In this context virtual data are data products 
with a well defined method of production or repro- 
duction. The concept of "virtuality" with respect to 
existence means that we can define data products that 
may be produced in the future, as well as record the 
"history" of products that exist now or have existed 
at some point in the past. Recording and discovering 
the relationships can be important for many reasons - 
some of them, adapted from 1] to high energy physics 
applications, are given below: 

• "I have found some interesting data, but I need 
to know exactly what corrections were applied 
before I can trust it." 

• "I have detected a muon calibration error and 
want to know which derived data products need 
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to be recomputed." 

• "I want to search a huge database for rare elec- 
tron events. If a program that does this analysis 
exists, I will not have to reinvent the wheel." 

• "I want to apply a forward jet analysis to lOOM 
events. If the results already exist, I will save 
weeks of computation." 

We need a "virtual data management" tool that can 
"re-materialize" data products that were deleted, gen- 
erate data products that were defined but never cre- 
ated, regenerate data when data dependencies or al- 
gorithms change, and/or create replicas at remote lo- 
cations when recreation is more efficient than data 
transfer. 

The virtual data paradigm records data provenance 
by tracking how new data is derived from transfor- 
mations on other data It focuses on two central 
concepts: transformations and derivations. A trans- 
formation is a computational procedure used to de- 
rive data. A derivation is an invocation of such a 
procedure, resulting in the instantiation of a potential 
data product. Data provenance is the exact history 
of any existing (or virtual) data product. Often the 
data products are large datasets, and the management 
of dataset transformations is critical to the scientific 
analysis process. 

From the scientist's point of view, data trackabil- 
ity and result auditability are crucial, as the repro- 
ducibility of results is fundamental to the nature of 
science. To support this need we require and envision 
something like a "virtual logbook" that provides the 
following capabilities: 

• easy sharing of tools and data to facilitate col- 
laboration - all data comes complete with a 
"recipe" on how to produce or reproduce it; 

• individuals can discover in a fast and well de- 
fined way other scientists' work and build from 
it; 

• different teams can work in a modular, semi- 
autonomous fashion; they can reuse previous 
data/code/results or entire analysis chains; 

• the often tedious procedures of repair and cor- 
rection of data can be automated using a 
paradigm similar to that which "make" imple- 
ments for rebuilding application code; 

• on a higher level, systems can be designed 
for workflow management and performance op- 
timization, including the tedious processes of 
staging in data from a remote site or recreating 
it locally on demand (transparency with respect 
to location and existence of the data); 



2. CHIMERA - THE GRIPHYN VIRTUAL 
DATA SYSTEM 

To experiment with and explore the benefits of data 
derivation tracking and virtual data management, a 
virtual data system called Chimera |ll|is under active 
development in the GriPhyN project [3- A persistent 
virtual data catalog (VDC), based on a relational vir- 
tual data schema, provides a compact and expressive 
representation of the computational procedures used 
to derive data, as well as invocations of those proce- 
dures and the datasets produced by those invocations. 

Applications access Chimera via a virtual data lan- 
guage (VDL), which supports both data definition 
statements, used for populating a Chimera database 
and for deleting and updating virtual data definitions, 
and query statements, used to retrieve information 
from the database. The VDL has two formats: a 
textual form that can be used for manual VDL com- 
position, and an XML form for machine-to-machine 
component integration. 

Chimera VDL processing commands implement re- 
quests for constructing and querying database entries 
in the VDC. These commands are implemented in 
JAVA and can be invoked from the JAVA API or from 
the command line. 

The Chimera virtual data language describes data 
transformation using a function-call-like paradigm. It 
defines a set of relations to capture and formalize de- 
scriptions of how a program can be invoked, and to 
record its potential and/or actual invocations. The 
main entities of this language are described below: 

• A transformation is an executable program. As- 
sociated with a transformation is an abstract 
description of how the program is invoked (e.g. 
executable name, location, arguments, environ- 
ment). It is similar to a "function declaration" 
in C/C-h- 1-. A transformation is identified by 
the tuple [namespace]:: identifier: [version ff\. 

• A derivation represents an execution of a trans- 
formation. It is an invocation of a transforma- 
tion with specific arguments, so it is similar to 
a "function call" in C/C-I--I-. Associated with 
a derivation is the name of the corresponding 
transformation, the names of the data objects to 
which the transformation is applied and other 
derivation-specific information (e.g. values for 
parameters, execution time) . The derivation can 
be a record of how data products came into exis- 
tence or a recipe for creating them at some point 
in the future. A derivation is identified by the 
tuple [namespace] ::identifier: [version range]. 

• A data object is a named entity that may be 
consumed or produced by a derivation. In the 
current version, a data object is a logical file, 
named by a logical file name (LFN) . A separate 
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Virtual Data Language 



TR pythia ( out a2 , in al , none parait^"160 . 0" ) 
{ 

argximent arg = ${param}; 

argument £iie = ${ai}; "Function declarations' 

argument file = ${a2}; 

} 

TR cmsim{ out a2 , in al[] ) 
{ 

argument files = ${al}; 
argument file = ${a2}; 

} 

DV xl->pYthia( a2=@ {out:f ilc2 } , al=@ { in : f ilel } ) ; 

DV x2->cmsim( a2=@ { out : f ile3 } , al= [@ { in :f ile2 } , 
@ { in : cardfile } ] ) ; 

"Function calls" 



filel 





xl 
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ile2. 


ca 


rdf 


ileP 









x2 



file3 



Figure 1: Example of a VDL description for a pipeline with two steps. 



replica catalog (RC) or replica location service 
(RLS) is used to map from logical file names 
to physical location(s) for repHcas. Associated 
with a data object is simple metadata informa- 
tion about that object. 

An example of the virtual data language descrip- 
tion for a simple pipeline with two steps is shown 
in Figure ^ We define two transformations, called 
PYTHIA and CMSIM, which correspond to the gen- 
eration of high energy interactions with the Monte 
Carlo program PYTHIA |2I ^^id the modeling of the 
detector response in the CMS experiment |5j, Q. 
Then we define two invocations of these transforma- 
tions where the formal parameters are replaced by 
actual parameters. The virtual data system detects 
the dependency between the output and input files of 
the different steps (here file2) and automatically pro- 
duces the whole chain. This is a simple chain - the 
virtual data system (VDS) has been tested success- 
fully on much more complex pipelines with hundreds 
of derivations ^ . 

The Chimera system supports queries which return 
a representation of the tasks to be executed as a di- 
rected acyclic graph (DAG) . When executed on a Data 
Grid it creates a specified data product. The steps of 
the virtual data request formulation, planning and ex- 
ecution process are shown schematically in Figure |21 



Chimera is integrated with other grid services to en- 
able the creation of new data by executing compu- 
tational schedules from database queries and the dis- 
tributed management of the resulting data. 



3. DATA ANALYSIS IN HEP 

After a high energy physics detector is triggered, 
the information from the different systems is read and 
ultimately recorded (possibly after cleaning, filtering 
and initial reconstruction) to mass storage. The high 
intensity of the LHC beams usually results in more 
than one interaction taking place simultaneously, so a 
trigger records the combined response to all particles 
traversing the detector in the time window when the 
system is open. The first stages in the data process- 
ing are well defined and usually tightly controlled by 
the teams responsible for reconstruction, calibration, 
alignment, "official" simulation etc. The application 
of virtual data concepts in this area is discussed e.g. 
in 8]. 

In this contribution we are interested in the later 
stages of data processing and analysis, when various 
teams and individual scientists look at the data from 
many different angles - refining algorithms, updating 
calibrations or trying out new approaches, selecting 
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Abstract and Concrete DASs 



Abstract DAXs (Virtual Data DAG) 
abstract directed acyclic graph with 
logical names for files/executables 
(complete build-style recipe as DAX) 

- Resource locations unspecified 

- File names are logical 

- Data destinations unspecified 

Concrete DAGs (stuff for DAGMan) 
CONDOR style DAG for grid execution 
(check RC, skip steps, make-style) 

- Resource locations determined 

- Physical file names specified 

- Data delivered to and returned from physical 
locations 
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Figure 2: Steps in the derivation of a data product. 



and analyzing a particular data set, estimating param- 
eters etc., and ultimately producing and publishing 
physics results. Even in today's large collaborations 
this is a decentralized, "chaotic" activity, and is ex- 
pected to grow substantially in complexity and scale 
for the LHC experiments. Clearly flexible enough sys- 
tems, able to accommodate a large user base and use 
cases not all of which can be foreseen in advance, are 
needed. Here we explore the benefits that a virtual 
data system can bring in this vast and dynamic field. 

An important feature of analysis systems is the abil- 
ity to build scripts and/or executables "on the fly" , in- 
cluding user supplied code and parameters. The user 
should be in position to modify the inputs on her/his 
desk(lap)top and request a derived data product, pos- 
sibly linking with preinstalled libraries on the execu- 
tion sites. A grid-type system can store large volumes 
of data at geographically remote locations and provide 
the necessary computing power for larger tasks. The 
results are returned to the user or stored and pub- 
lished from the remote site(s). An example of this 
vision is shown in Figure 13 The Chimera system can 
be used as a building block for a collaborative analysis 
environment, providing "virtual data logbook" capa- 
bilities and the ability to explore the metadata asso- 
ciated with different data products. 

To explore the use of virtual data in HEP analysis, 
we take as a concrete (but greatly simplified) example 



an analysis searching for the Higgs boson at the LHC. 
The process begins with an analysis group that defines 
a virtual data space for future use by it's members. At 
the start, this space is populated solely with virtual 
data definitions, and contains no materialized data 
products at all. A subgroup then decides to search 
for Higgs candidates with mass around 160 GeV. It 
selects candidates for Higgs decaying to W^W~ and 
ZZ bosons, T^r^ leptons and bb quarks. Then it con- 
centrates on the main decay channel H W^W^ . 
To suppress the background, only events where both 
Ws decay to leptons are selected for further process- 
ing. Then the channel WW evfxv is picked up as 
having low background. At each stage in the analysis 
interesting events can be visualized and plots for all 
quantities of interest can be produced. Using the vir- 
tual data system all steps can be recorded and stored 
in the virtual data catalog. Let us assume that a new 
member joins the group. It is quite easy to discover 
exactly what has been done so far for a particular 
decay channel, to validate how it was done, and to 
refine the analysis. A scientist wanting to dig deeper 
can add a new derived data branch by, for example, 
applying a more sophisticated selection, and continu- 
ing to investigate down the new branch (see Figure^ . 
Of course, the results of the group can be shared easily 
with other teams and individuals in the collaboration, 
working on similar topics, providing or re-using bet- 
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Complex Data Flow and Data Provenance in HEP 




Figure 3: Example of a collaborative analysis environment. 



ter algorithms etc. The starting of new subjects will 
profit from the availability of the accumulated expe- 
rience. At publication time it will be much easier to 
perform an accurate audit of the results, and to work 
with internal referees who may require details of the 
analysis or additional checks. 

4. PROTOTYPES 

In this section we describe the process of incorpo- 
rating Chimera in a prototype of real analysis sys- 
tem, and examine some of the issues that arise. In 
this study we use events generated with PYTHIA (or 
the CMS PYTHIA implementation in CMKIN), and ana- 
lyze, histogram and visualize them with the object- 
oriented data analysis framework ROOT In the 
basic Chimera implementation, the transformation is 
a pre-existing program. Using the flexibility of the 
virtual data system, we design our prototype with 
"strong" data provenance by using additional steps in 
the pipeline. The Concurrent Version System (CVS) 
is well suited to provide version control for a rapid de- 
velopment by a large team and to store, by the mech- 
anism of tagging releases, many versions so that they 



can be extracted in exactly the same form even if mod- 
ified, added or deleted since that time. In our design 
we use wrappers (shell scripts) at all stages in order 
to make the system more dynamic. In a first step we 
provide a tag to the VDS and (using CVS) extract the 
FORTRAN source code and the library version number 
for the second, the data cards for the third and the 
C++ code for the last step. In the second step we com- 
pile and hnk PYTHIA "on the fiy", using the library 
version as specified above. In the third step we gen- 
erate events with the executable and the datacards 
from the first two steps. In the next, rather techni- 
cal, step, we convert the generated events, which are 
stored in column-wise ntuples using FORTRAN calls to 
HBOOK, to ROOT trees for analysis. In the final step 
we execute a ROOT wrapper which takes as input the 
C++ code to be run on the generated events and pro- 
duces histograms or event displays. After we define 
the transformations and some derivations to produce 
a given data product, the Chimera system takes care 
of all dependencies, as shown in the DAG of FigureEl 
which is "assembled" and run automatically. 

The Chimera configuration files describing our in- 
stalled transformations are presented below for refer- 
ence. 
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Figure 4: Example of an analysis group exploring a virtual data space. 



Transformation catalog 

#pool Itransf ormation 

local hw 

local pythcvs 

local pythlin 

local pythgen 

local pythtree 

local pythview 

local GriphynRC 

local globus-url-copy 

uf 1 hw 

ufl GriphynRC 

ufl globus-url-copy 



(expects pre-built executables) 



physical transformation 
/bin/echo 

/workdir/lhc-h-6-cvs 
/workdir/lhc-h-6-link 
/ workdir /lhc-h-6-run 
/workdir/h2root . sh 
/workdir /root . sh 



environment String 
null 
null 
null 
null 
null 
null 

/vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdkl .3;VDS_HDME=/vdshome 
/vdt/bin/globus-url-copy GL0BUS_L0CAT10N=/vdt ; LD_LIBRARY_PATH=/vdt/lib 
/bin/echo null 

/vdshome/bin/replica-catalog JAVA_HOME=/vdt/ j dkl . 3 . 1_04 ; VDS_HDME=/vdshome 
/vdt/bin/globus-url-copy GL0BUS_L0CAT10N=/vdt ; LD_LIBRARY_PATH=/vdt/lib 



In our implementation we use MySQL as the per- 
sistent store for the virtual data catalog. The trans- 
formations are executed using a Chimera tool called 
the shell planner, which permits rapid prototyping by 
VDL processing through execution on a local machine 
rather than on a full-scale grid. Alternatively, the sys- 
tem can produce a DAG which can be submitted to a 
local Condor [13 pool or to a grid scheduler. 

In the last step of the illustrated derivation graph 
we analyze the generated events using the rich set of 
tools available in ROOT. Besides selecting interesting 
events and plotting the variables describing them, we 



develop a light-weight visualization in CH — h based on 
the ROOT classes. Using this tool the user can rotate 
the event in 3D, produce 2D projections etc. Exam- 
ples are shown in Figures El and 13 



5. OUTLOOK 

We have developed a light-weight 
Chimera/PYTHIA/ROOT prototype for building exe- 
cutables "on the fly", generating events with PYTHIA 
or CMKIN, analyzing, plotting and visualizing them 
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Prototype 




Figure 5: Analysis prototype. 



with ROOT. Our experience shows that Chimera is a 
great integration tooL We are able to build a system 
from components using CVS, MySQL, FORTRAN code 
(PYTHIA) and C++ code (ROOT). The data provenance 
is fully recorded and can be accessed, discovered and 
reused at any time. The results reported here are 
a snapshot of work in progress which is continuing 
to evolve both in Chimera capabilities and their 
application to CMS analysis. 

This work can be extended in several directions: 

• collaborative workflow management 

• automatic generation of derivation definitions 
from interactive ROOT sessions 

• an interactively searchable metadata catalog of 
virtual data information 



• a more powerful abstractions for datasets, be- 
yond simple files 

• control of all phases in the solving of multi- 
step CPU intensive scientific problems (e.g. 
the study of parton density function uncertain- 
ties |11(|) 

• integration with the CLARENS system 0| for 
remote data access 

• integration with the ROOT/PROOF system for par- 
allel analysis of large data sets 

Interested readers can try out the prototype demo, 
which at the time of this writing is available at the 
following URL: 



grinhead . phys . uf 1 . edu/~bourilkov/pythdemo/pythchain .php . 

I 
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ternational Conference on Scientific and Statisti- 
cal Database Management (SSDBM 2002), Edin- 
burgh, 2002. 

[2] P. Avery, I. Foster, "The GriPhyN Project: To- 
wards Petascale Virtual-Data Grids" , GriPhyN 
Technical Report 2001-14, 2001. 

[3] T. Sjostrand et al, "High-energy-physics event 
generation with PYTHIA 6.1" Comp. Phys. Com- 
mun. 135 (2001) 238. 

[4] The CMS coUaboration, "The Compact Muon 
Solenoid - Technical Proposal", CERN/LHCC 
94-38, CERN, Geneva, 1994. 

[5] V. Innocente, L. Silvestris, D. Stickland, "CMS 
Software Architecture Software framework, Ser- 
vices and Persistency in High Level Trigger, 
Reconstruction and Analysis" CMS NOTE- 



2000/047, CERN, 2000.^ 

[6] http : / / cms-pro j ect-ccs . web . cern. ch / cms-pro ject-ccs / 



[7] J. Annis et al., "Applying Chimera Virtual Data 
Concepts to Cluster Finding in the Sloan Digital 
Sky Survey", Supercomputing 2003. 
[8] A. Arbree et al, "Virtual Data in CMS Produc- 
tion", TU ATOll, these proceedings . 
[9] http: / / ro ot .cern.ch/Welcome.htmll . 
[10] http://www.cs.wisc.edu/condor" . 
[11] D. Bourilkov, "Study of parton density func- 
tion uncertainties with LHAPDF and PYTHIA 
at LHC," arXiv:hep-ph/0305126, 
[12] http: / / clarens.sourceforge.net/^ . 



TUATOlO 




TUATOlO 



